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Abstract 

Background: Despite the passing of more than a year since the first outbreak of Severe Acute 
Respiratory Syndrome (SARS), efficient counter-measures are still few and many believe that 
reappearance of SARS, or a similar disease caused by a coronavirus, is not unlikely. For other virus 
families like the picornaviruses it is known that pathology is related to proteolytic cleavage of host 
proteins by viral proteinases. Furthermore, several studies indicate that virus proliferation can be 
arrested using specific proteinase inhibitors supporting the belief that proteinases are indeed 
important during infection. Prompted by this, we set out to analyse and predict cleavage by the 
coronavirus main proteinase using computational methods. 

Results: We retrieved sequence data on seven fully sequenced coronaviruses and identified the 
main 3CL proteinase cleavage sites in polyproteins using alignments. A neural network was trained 
to recognise the cleavage sites in the genomes obtaining a sensitivity of 87.0% and a specificity of 
99.0%. Several proteins known to be cleaved by other viruses were submitted to prediction as well 
as proteins suspected relevant in coronavirus pathology. Cleavage sites were predicted in proteins 
such as the cystic fibrosis transmembrane conductance regulator (CFTR), transcription factors 
CREB-RP and OCT-1, and components of the ubiquitin pathway. 

Conclusions: Our prediction method NetCorona predicts coronavirus cleavage sites with high 
specificity and several potential cleavage candidates were identified which might be important to 
elucidate coronavirus pathology. Furthermore, the method might assist in design of proteinase 
inhibitors for treatment of SARS and possible future diseases caused by coronaviruses. It is made 
available for public use at our website: http://www.cbs.dtu.dk/services/NetCorona/ . 


Background 

In the spring of 2003, the Severe Acute Respiratory Syn¬ 
drome (SARS) caused numerous fatalities particularly in 
Southeast Asia and gravely affected the global economy. 
The causative agent was shown to be a human coronavirus 
[i]- a virus type which normally causes mild cold symp¬ 
toms in humans. The abrupt appearance raises concern of 


another break-out of an epidemic of SARS virus or similar 
strains in the future. 

Coronaviruses are found in different species ranging from 
chicken to cattle and humans. Currently, seven coronavi¬ 
rus genomes, including SARS coronavirus (CoV), have 
been fully sequenced and cluster into four main groups, of 
which SARS-CoV occupies its own [2,3]. Polyproteins 
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encoded by the coronavirus RNA are processed by viral 
proteinases yielding mature proteins. The main protein¬ 
ase 3 CLP™ performs at least eleven proteolytic cleavages 
within a single viral polyprotein [4,5]. Viral polyprotein 
processing is a common theme in viral molecular biology, 
e.g. as seen in picornaviruses and retroviruses like HIV. 
Therefore, essential viral proteinases have been suggested 
as potential targets for specific therapeutic approaches, 
e.g. by development of specific proteinase inhibitors [6- 
8 ]. 

In the case of picornaviruses, virus-encoded proteinases 
are able to cleave specific cellular targets and thereby 
severely inhibit the cellular translational machinery (the 
"host cell shut-off" response) while still allowing for high 
translational activity of viral mRNA [9]. Earlier, we devel¬ 
oped a computational approach for predicting potential 
cleavage sites of picornavirus proteinases 2A and 3C [10]. 
Badorff et al. successfully used this cleavage predictor to 
identify the cellular target dystrophin, which they experi¬ 
mentally showed to be cleaved both in vitro and in vivo 
in]. However, preliminary studies revealed that this 
model is not compatible with coronavirus cleavage sites. 
The general approach is still valid though, and we decided 
to apply this method to the problem of predicting the 
3CLP™ proteinase cleavage sites and identifying potential 
host cell target proteins. We propose that a deeper under¬ 
standing of coronavirus proteinase function and substrate 
specificity may benefit further research by: i) increasing 
the understanding of substrate specificity determinants 
which may direct studies focusing on the development of 
specific proteinase inhibitors and ii) providing a method 
for screening cellular target proteins for potential corona¬ 
virus proteinase cleavage sites. 

In this paper, we describe the development of a computa¬ 
tional prediction method using artificial neural networks 
for predicting coronavirus 3CLP™ proteinase cleavage sites. 
The method is based on known cleavage sites in seven 
members of the coronavirus family as the cleavage sites 
are believed to be sufficiently conserved among family 
members. This notion is supported by the fact that the 
SARS 3CLP™ proteinase has recently been shown capable 
of catalysing the cleavage of peptide fragments from other 
coronaviruses at the expected cleavage sites [12]. 

We discuss potential targets of 3CLP™ proteinase, e.g. the 
cystic fibrosis transmembrane conductance regulator 
(CFTR) and translational and transcriptional factors, 
which may be involved in the molecular pathology of 
coronaviruses in general and SARS virus in particular. 


Results 

Analysis of the proteinase cleavage site 

The 77 annotated coronavirus polyprotein main protein¬ 
ase cleavage sites were aligned without gaps by constrain¬ 
ing the PI position. Every site had a glutamine (Q) in 
position PI (the position just before the cleavage site; the 
positions are named as suggested by Berger and Schechter 
[13] with PI, P2, ... etc., N-terminal to the cleavage site 
and PI', P2', ... etc., C-terminal to the cleavage site). Prom 
the sequence logo (figure 1) a very strong consensus is 
evident around the cleavage site. As discussed by others 
[14,15], the coronavirus 3C-like proteinase shares many 
traits with its picornavirus 3C proteinase counterpart, 
hence the name. This is reflected in the cleavage site logo 
although differences between the two are also apparent. 
Positions PI', PI, and P4 have similar amino acid distri¬ 
bution in the 3C and 3CL proteinase cleavage sites. On the 
other hand, the coronavirus proteinase has a strong pref¬ 
erence for leucine at position P2 while this position is rel¬ 
atively non-conserved among picornavirus proteinase 
cleavage sites [10]. A recently published study of the crys¬ 
tal structure of 3 CLP™ from the 229E strain of human coro¬ 
naviruses indicates that residues at positions P5 to P3 
form an anti-parallel [3 sheet with part of the proteinase, 
signifying their importance in cleavage site recognition 
[7]. 

It is clear from the above that a simple, position specific 
consensus sequence is difficult to define. With the present 
data set from seven different coronaviruses it is possible to 
classify correctly 60 (78%) of the 77 cleavage sites by 
matching an 'LQ' consensus pattern. However, an addi¬ 
tional 196 sites in the viral polyproteins are incorrectly 
classified as cleavage sites, being random occurrences of 
this pair of amino acids. Classification is improved by 
using the consensus pattern 'LQ [S/A]', meaning Leu-Gln- 
(Ser OR Ala), but it is still far from being a useful classifier. 
The false positive rate is now down to 36 wrong sites, but 
at the same time only 48 (62%) of the correct cleavage 
sites are detected. As the pattern becomes more sophisti¬ 
cated, specificity increases (reducing the number of false 
positives) but at the same time sensitivity drops dramati¬ 
cally (i.e. fewer of the true sites are detected). 

Neural network training and performance 

To overcome the limitations of simple consensus patterns, 
we trained an artificial neural network to identify the 
cleavage sites. The best model was obtained using a three¬ 
layered neural network with two hidden neurons and a 
sequence window encompassing nine amino acids cen¬ 
tered on the PI position, thus encompassing P5-P4'. The 
network evaluates and assigns a score between 0 and 1 to 
every glutamine to which it is presented, where a score 
above 0.5 is considered a positive answer (i.e. a cleavage 
site is predicted). This model was able to classify correctly 
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Figure I 

Logo plot of a multiple alignment of 77 coronavirus 
cleavage sites. The height of the letters reflects the Shan¬ 
non information at individual positions (see Methods section 
for detailed information). 


67 of 77 known cleavage sites (87.0%) and 1,358 of 1,372 
(99.0%) sites assumed not to be cleaved by the proteinase 
when testing on independent sites not included when 
training. The neural network method could thus identify 
many more of the positive sites with fewer false positives 
than simple consensus-type methods thereby increasing 
the classification performance. The Matthews correlation 
coefficient reached 0.84 for the artificial neural network 
compared to 0.37, 0.53 and 0.51 for increasingly complex 
consensus patterns ('LQ', 'LQ [S/A]', ' [T/S/AJX [L/F]Q [S/ 
A/G]' respectively) (Figure 2). 

To evaluate the predictive power of the neural network, 
we performed a basic bayesian analysis of the data set test 
results. The scoring range from 0 to 1 was divided into ten 
bins and the posterior probability of a positive prediction 
(a prediction indicating a cleavage) being true was calcu¬ 
lated and plotted (Figure 3). The posterior probability in 
the range 0.5 to 0.8 cannot be determined accurately since 
relatively few examples score in this interval - only 3% of 
the test set (both positive and negative examples) scores 
between 0.4 and 0.8. However, results indicate that pre¬ 
diction scores can be classified into three categories, those 
that fall below 0.5 are most likely not cleaved, those that 
fall between 0.5 and 0.8 are possibly cleaved and those 
above 0.8 are most likely cleaved if available to the 
proteinase. 

Analysis of selected human proteins 

As mentioned above, there are several experimentally ver¬ 
ified examples of host cell protein cleavage by virus pro- 
teinases. Thus, both these and other non-coronavirus 
proteins from Swiss-Prot [16] 41.0 were examined for 
potential cleavage sites. In total three groups of proteins 


were examined: i) proteins known to be cleaved by other 
viruses, ii) proteins which could be targets when consider¬ 
ing the pathology of coronaviruses iii) proteins related to 
the expected immune response to a viral infection. 
Eukaryotic translation initiation factor 4 gamma 
(IF4G_HUMAN) has a potential cleavage site after 
Gln838 (0.822), but also at two other positions although 
with lower cleavage scores. Cleavage of this protein may 
lead to host cell shut-off in a similar way to what has been 
described for picornavirus 2A proteinase [17]. 

Two subunits of the RNA polymerase III are predicted tar¬ 
gets of the coronavirus proteinase 3 CL P ro . RNA polymerase 
(RPC1_HUMAN) has a predicted cleavage site after 
Glnl95 with a score (0.765) well above the 0.5 cut-off. The 
protein is the second largest subunit of the RNA 
polymerase III complex and if this protein is indeed a cel¬ 
lular proteinase target it might cause disruption of the 
RNA polymerase III complex upon infection with a coro¬ 
navirus. A similar disruption would be expected in case of 
a cleavage of the largest subunit of the complex 
(RPA1_HUMAN) which also has a predicted cleavage site 
(at position 329, score 0.704). It agrees with findings that 
poliovirus disrupts RNA polymerase III function, 
although this occurs through cleavage of transcription fac¬ 
tor IIIC and not the polymerase subunits themselves [18- 
20]. Several well-known transcription factors contain 
potential cleavage sites. The highest scoring is CREB-RP 
(AT6B_HUMAN) with a predicted cleavage site at Gln358 
(0.916) close to the DNA binding leucine zipper motif. 
This is in agreement with findings from picornavirus 3CP ro 
proteinase although at a different position in the sequence 
[21]. OCT-1 (P021_HUMAN) is also predicted to be 
cleaved by the 3 CL P ro proteinase with high confidence 
(0.874) following Gln62 again corresponding to experi¬ 
mental evidence from picornavirus [22]. Several subunits 
of the transcription initiation factor TFIID, which is a ver¬ 
ified target in poliovirus infections [23], have predicted 
cleavage sites; the 250 kDa subunit (T2D1_HUMAN), the 
135 kDa subunit (T2D3_HUMAN), and the 105 kDa sub¬ 
unit (T2DT_HUMAN). 

The tumor-suppressor protein P53 is known to be cleaved 
by picornavirus 3CP ro proteinase [24] but this protein is 
not predicted to contain any coronavirus 3CP ro proteinase 
cleavage sites. However, P5 3-binding protein 1 
(P531_HUMAN) and P53-binding protein 2 
(P532_HUMAN), which stimulate p53-mediated tran¬ 
scriptional activation [25], have several potential cleavage 
sites. 

Another known target for viral infections is the microtu¬ 
bule-associated protein 4 (MAP-4) which is cleavable in 
HeLa cells by the poliovirus 3CP ro proteinase [26,27]. 
MAP-4 (MAP4_HUMAN) might also be cleavable by 
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Figure 2 

Method performance comparison. Using consensus patterns or neural network (NN) to identify cleavage sites. Green 
bars are percentage of true positives, red bars are percentage of true negatives and blue bars are Matthews correlation coeffi¬ 
cients multiplied with 100. 


3CLP™ albeit with a low score (after Gin 1005 with a score 
of 0.519) and furthermore microtubule-associated pro¬ 
tein RP/EB member 1 and 3 (MAE1_HUMAN and 
MAE3_HUMAN) have sites which obtain scores above 
0.5. The position of the possible cleavage site in MAP-4 is 
different from that observed with poliovirus 3C P ro reflect¬ 
ing the different specificity of this proteinase. 

Lung related proteins were examined as early symptoms 
of SARS could indicate a relation. The cystic fibrosis trans¬ 
membrane conductance regulator (CFTR_HUMAN) is an 
ATP-dependent chloride channel. It has a predicted cleav¬ 
age site with a high score (0.842) following Gln762 in the 
human sequence. This part of the membrane protein is 


cytoplasmic and contains several phosphorylation sites 
(residues 660 - 813) indicating an accessible region. 

The epithelial sodium channels play an important role in 
lung liquid homeostasis [28] and the amiloride-sensitive 
sodium channel 5-subunit (SCAD_HUMAN) has a pre¬ 
dicted cleavage site in the cytoplasmic C-terminus (after 
residue 22) scoring 0.828. A number of proteins involved 
in the ubiquitin pathway which targets proteins to the 
proteasome, a necessary step to generate an immune 
response, have predicted cleavage sites (Swiss-Prot entries 
UBP1_HUMAN, SOC6_HUMAN, UBPD_HUMAN, 
UBP4_HUMAN, UBP5_HUMAN, UBPQ_HUMAN, 
FAFY_HUMAN, FAFX_HUMAN ). Cleavage of one or 


Page 4 of 9 

(page number not for citation purposes) 



























BMC Bioinformatics 2004, 5 


http://www.biomedcentral.eom/1471-2105/5/72 



Scoring range 


Figure 3 

Reliability analysis of data set test results. Scoring range (0-1) was divided into ten bins. The fraction of negative exam¬ 
ples in each bin is illustrated with red bars, the fraction of positive examples is illustrated with green bars, blue bars are poste¬ 
rior probabilities of a true cleavage prediction (see Methods section for detailed information). 


more of these proteins may lead to reduced presentation 
of viral peptides to cytotoxic T lymphocytes thereby 
inhibiting the cellular immune response. IRAK-1 
(IRA1_HUMAN) which is involved in IL-1 induced activa¬ 
tion of cells has a predicted cleavage site after Gln457 
scoring 0.859. 

Interferon-induced protein 6-16 precursor 
(INI2_HUMAN) is a membrane protein and was pre¬ 
dicted to possess a cleavage site following Gln97 (0.890) 
which is located in the cytoplasmic part of the mature pro¬ 
tein. Protein 6-16 has been shown to enhance interferon- 
a antiviral efficacy [29]. Interferon-a, -[3, and -y are known 
to be involved in antiviral defence and have been 
employed for treatment of SARS [30], but the interferons 
themselves do not seem to possess cleavage sites. 

We have listed the human proteins analysed in this work 
in a table (Table 1). 


Discussion 

We have developed a neural network capable of identify¬ 
ing the cleavage site of the coronavirus proteinase 3CLP ro 
and use this model to predict potential cleavage sites in 
host cell proteins. The predictor is highly specific which 
means that few false positives are expected, in fact on 
independent test sets we observed a false positive rate 
around 1%. The optimal network window size of nine res¬ 
idues agrees well with available structural information 
about the proteinase from human coronavirus 229E 
which indicates that the active site makes contact with at 
least four residues N-terminal to the glutamine [7]. 

The ten sites known to be cleaved but failed to be recog¬ 
nised by the neural network are not dramatically different 
from the remainder of the sites (Table 2). We therefore do 
not suspect these to be sites of a different hitherto 
unknown proteinase, but it would be interesting to see if 
the lower prediction score reflects a lower cleavage effi¬ 
ciency in vivo. Of the fourteen negative examples wrongly 
predicted as cleavable (Table 3), the highest scoring were 
examined more closely. The selected examples all show 
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Table I: Selected potential cleavage sites in human proteins from the Swiss-Prot database examined in this work. Columns represent 
Swiss-Prot identifier, predicted cleavage site position of PI in the target protein, cleavage site score, and cellular localisation of target 
protein (Cyt - cytoplasmic, Nuc - nuclear, Mem - membrane associated). The last column lists the cleavage site in the sequence - 
cleavage is predicted between the central glutamine residue (Q) and the following amino acid residue. Sorted by prediction score. 


Swiss-Prot ID 

Loc 

Position 

Score 

Sequence 

AT6B_HUMAN 

Nuc 

358 

0.916 

EARLQAVLAD 

INI2_HUMAN 

Mem 

97 

0.890 

VATLQSLGAG 

P02I_HUMAN 

Nuc 

62 

0.874 

GTSLQAAAQS 

IRAIJHUMAN 

Cyt 

457 

0.859 

QSTLQAGLAA 

CFTRJHUMAN 

Mem 

762 

0.842 

GPTLQARRRQ 

SCAD JHUMAN 

Mem 

22 

0.828 

GSHLQAAAQT 

P532JHUMAN 

Nuc 

308 

0.782 

ASVPQSTGNA 

RPC 1 JHUMAN 

Cyt 

195 

0.765 

SNFLQSFETA 

P531 JHUMAN 

Nuc 

196 

0.738 

KEQLQSVTTN 

T2DI JHUMAN 

Nuc 

741 

0.730 

GQLLQAFENN 

P532JHUMAN 

Nuc 

197 

0.725 

KAALQQKENL 

RPAI JHUMAN 

Cyt 

329 

0.704 

TVNLQAVMKD 

CFTRJHUMAN 

Mem 

958 

0.693 

HSVLQAPMST 

MAE 1 JHUMAN 

Cyt 

64 

0.661 

KVKFQAKLEH 

MAE3 JHUMAN 

Cyt 

64 

0.661 

KVKFQAKLEH 

P531 JHUMAN 

Nuc 

410 

0.660 

QKKLQSGEPV 

CFTRJHUMAN 

Mem 

890 

0.654 

NTPLQDKGNS 

P532JHUMAN 

Nuc 

722 

0.624 

SPNLQNNPEE 

T2DT JHUMAN 

Nuc 

133 

0.619 

PSSVQSVAVP 

T2D3JHUMAN 

Nuc 

610 

0.570 

SSGKQSTETA 

MAP4 JHUMAN 

Cyt 

1005 

0.519 

YSHIQSKCGS 


Table 2: Known main proteinase cleavage sites in coronavirus polyproteins used in this study, which were missed by the neural network 
during cross-validation. Position refers to position in the viral polyprotein. The last column lists the cleavage site in the sequence - 
cleavage occurs between the central glutamine residue (Q) and the following amino acid residue. 


Accession 

Position 

Virus 

Sequence 

NC_00I451 

3928 

AIBV 

KSSVQSVAG 

NC_001846 

3923 

MHV 

VSQIQSRLT 

NC_001846 

5984 

MHV 

NPRLQCTTN 

NC_002306 

5527 

TGV 

KIGLQAKPE 

NC_003045 

5900 

BCoV 

ETRVQCSTN 

NC_003436 

3299 

PEDV 

GVNLQGGYV 

NC_003436 

6141 

PEDV 

SNNLQGLEN 

NC_004718 

3546 

SARS 

GVTFQGKFK 

NC_004718 

4369 

SARS 

EPLMQSADA 

NC_0047I8 

5902 

SARS 

VATLQAENV 


some resemblance to real cleavage sites but also some 
resemblance to negative examples which are not predicted 
as cleavable. They may represent sites in-between which 
are cleavable to a certain extent but are shielded from 
cleavage due to conformational issues. 

Predicted sites even with high scores which are inaccessi¬ 
ble to the proteinase (like extracellular domains, 
transmembrane domains, or buried domains in globular 


proteins) should be disregarded, as accessibility informa¬ 
tion is not available to the neural network. Cleavage sites 
probably exist that are not cleaved because they are not 
exposed to the solvent sufficiently for the proteinase to 
work. 

Others have attempted recognising the cleavage sites of 
the 3CL proteinase as a component of a coronavirus gene 
prediction server using different methods [31 ]. As the goal 
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Table 3: Negative examples predicted to be cleaved by the neural network during cross-validation. Position refers to position in the viral 
polyprotein. The last column lists the cleavage site in the sequence - cleavage is predicted between the central glutamine residue (Q) 
and the following amino acid residue. 


Accession 

Position 

Virus 

Sequence 

NC_001846 

3607 

MHV 

HSGFQGKQI 

NC_001846 

6613 

MHV 

YTDLQCIES 

NC_002306 

1457 

TGV 

ETSLQCLLK 

NC_002306 

5747 

TGV 

YSSSQSVYA 

NC_002306 

698 

TGV 

ETNIQAIKN 

NC_002306 

85 

TGV 

SVMLQGFIV 

NC_002645 

1 169 

HCoV-229E 

IRQLQGTII 

NC_002645 

2659 

HCoV-229E 

YSSIQANAY 

NC_002645 

322 

HCoV-229E 

VIALQSVDC 

NC_003045 

1364 

BCoV 

DARTQGKQS 

NC_003045 

1498 

BCoV 

RTFVQSNVD 

NC_003045 

2713 

BCoV 

SSDFQHKLK 

NC_003045 

311 

BCoV 

VMRLQSAST 

NC_003436 

1751 

PEDV 

SAGLQAMWE 


was different, that predictor is not publicly available and 
no performance values have been published. 

Conclusions 

Our method can be employed by researchers suspecting a 
possible viral proteinase cleavage but may also prove use¬ 
ful for researchers working with coronavirus function. 
Finally, the method might facilitate proteinase blocking 
based drug discovery by providing hints about proteinase 
affinity to various non-cleavable peptide ligands, which is 
a possible strategy for drug development [7,32]. 

Methods 

Data Set Preparation 

Seven full-length coronavirus genomes were retrieved 
from the GenBank database [33] with the following acces¬ 
sion numbers: NC_001451 (Avian infectious bronchitis 
virus, AIBV), NC_001846 (Murine hepatitis virus, MHV), 
NC_002645 (Human coronavirus 229E, HCoV-229E), 
NC_003436 (Porcine epidemic diarrhea virus, PEDV), 
NC_003045 (Bovine coronavirus, BCoV), NC_002306 
(Transmissible gastroenteritis virus, TGV), and the TOR2 
strain of SARS NC_004718. Deduced polyprotein 
sequences were aligned and cleavage sites identified from 
the annotation in NC_004718. Each sequence contained 
eleven 3 CLP™ proteinase cleavage sites, thus a total of 77 
of these sites were identified. For training a neural net¬ 
work classifier, a number of negative examples (presumed 
non-cleavage sites) are required. For this purpose, all 
other glutamines in the viral polyproteins were treated as 
non-cleavable sites. 

Three test sets were created for three-fold cross-validation 
and the training set for one was created by combining the 


two other test sets. Every test set thus contained 483 exam¬ 
ples of which 25 or 26 were positive examples. All testing 
and results reported are combined values of the three test 
sets, which are run individually with three separate neural 
networks to avoid testing on sequences included in train¬ 
ing sets. 

Sequence logos 

Amino acid conservation in multiple sequence align¬ 
ments may be visualised using sequence logos. The height 
of the amino acid one-letter abbreviations reflect the 
Shannon information content [34] in units of bits at that 
specific position in the multiple sequence alignment [35]. 
The basic idea behind the visualisation technique is that 
the height of each letter in a given position reflects its 
probability p k (i ). The total height of the column reflects 
the total information content (D(i)) at that specific posi¬ 
tion in the alignment given by (for proteins): 

20 

D(i) = log 2 20+52 Pki}) lo g 2 Phi 0 

k=l 

Very conserved positions will then get tall columns with 
the height of individual residue symbols reflecting the 
amino acid distribution. 

Training the neural networks 

The artificial neural networks used in this work were of the 
standard feed-forward type. Sparse encoding was used for 
translating the amino acids to data input for the networks 
as has been described previously [36,10,37]. 

Training was done with three-fold cross-validation and 
Matthews correlation coefficients [38] were calculated by 
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summing up true positives, false positives, true negatives, 
and false negatives in all combinations of training and test 
sets. Using an architecture with two hidden neurons and 
a symmetric window of nine amino acids centered on the 
glutamine in the PI position it was possible to obtain a 
correlation coefficient of 0.84 on cross-validated test sets. 
Care was taken to ensure that all cleavage sites were 
equally distributed in every cross-validated set. 

Bayesian statistics 

The validity of the statistics depends on the expected frac¬ 
tion of cleavage sites in a given data set, which we only 
know in the data set at hand. Statistics was thus done on 
the data set test results in order to create a histogram of 
prediction probabilities. Statistics was done using Bayes' 
Theorem: 


; P(X ; |C pos ) P(C pos ) 

P(X l 

The prediction outcome (0-1) was divided into 10 bins 
(X 1 ) with increments of 0.1. The posterior probability 
P(C pos \X l ) gives the probability of a positive prediction 
(that is, a cleavage) being true given the bin. This can be 
calculated from the prior probability P(C pos ), which is the 
fraction of positive examples in the data set, and the class- 
conditional probability P(X l \C pos ) for positive examples, 
which is the fraction of positive examples in the bin X 1 . 
P(X l ) is the fraction of prediction outcomes in bin X 1 . 

Searching for potential cleavage sites 

An averaged sum of the score of all three networks arising 
from the three-fold cross-validation was used for predic¬ 
tion. Each network outputs a score in the range [0.000- 
1.000], where scores below 0.5 indicate non-cleavage and 
scores above 0.5 indicate potential cleavage. This method 
is also employed by the prediction web server mentioned 
below. The Swiss-Prot database [16] release 41.0 (Febru¬ 
ary 2003) was downloaded and proteins from this 
database were used as targets for the neural network 
predictions. 

Availability 

Our neural network based prediction method, NetCo- 
rona, for prediction of potential cleavage sites of the SARS- 
3 CU ro proteinase is publicly available by following the 
link 'CBS prediction servers' from http://www.cbs.dtu.dk 
or at this specific URL: http://www.cbs.dtu.dk/services/ 
NetCorona/ 
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